Introduction to Open Data Science - Course Project

Chapter 1 - Intro

Seppo wRites some notes heRe

Fri Oct 30, 2020

Here is the link: https://github.com/avaruusseppo/IODS-project

I have some experience with R already but yet not too familiar with the R markdown. Thanks to Kimmo for giving a hint about R bookdown. Also thanks for the peppy warm up speech!

# Let us do something in this "R chunk" .

plot(sin,from=-2,to=2,xlab="days after course start",ylab = "enthusiasm")

A small gRaph of my musings.


Chapter 2 - Data analysis exercise

Fri 06 Nov 2020

i) Loading a prepared data set and examine its contents.

Loading a prepared csv table, containing data from a learning study (International survey of Approaches to Learning, Kimmo Vehkalahti, fall 2016). Metainformation: https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-meta.txt

lrn14<-read.csv("data/learning2014.csv")
print(dim(lrn14))
## [1] 166   7
str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

A quick overview of the contents in this data:

This is a learning study on a statistics students, observing the different factors that might relate to a high or low exam score (points) for students of different ages and genders.

The students were interviewed about their attitude towards statistics, and different motivational questions about their modes of learning.

The mode-related questions are grouped in three different categories (deep, surface and strategic). The category modes are averages of the numeric questionnaire results.

By examining the structure, we can see that the data has 166 observations in 7 variable columns.

  • Three background attributes given from the informants: attitude estimate, age and gender.
  • Three pre-processed learning factors are: surface, deep and strategic.
  • Data “points” describe the exam points or the output score. (Note! Zero exam point score rows are filtered out.)

ii) An overview of the data

The summary() function for a data frame in R shows the ranges where the values of variables reside.

# This show summaries of the variables in the data
summary(lrn14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The learning mode parameters (deep, stra, surf) are averaged from a Likert style (1-5) questionnaire. The attitude is the subjective meter of interest towards learning statistics. The age is the student’s age in years.

The exam points (used as an output variable in this study) vary from 7 to 33. By looking at the 1st and 3rd quantile in the summary output, the score density could be expected to fit in the Gaussian bell curve.

Simple plot with linear fitting

We’ll try quickly a scatter plot with matching attitude to exam points. This is quite much what the excel chart wizards do (but nothing more).

# We need the 'ggplot2' graphics package and ggpairs (from GGally)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# A scatter plot of students attitude and exam points
qplot(attitude, points, col = gender, data = lrn14) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

When we examine graphically the correlation of attitude to learning and the scores, we can find a positive correlation. What we can see here is that the gender is not too much a factor in the course scores. A great attitude would predict a high score so we would expect a positive correlation.

As a second example, we might be especially interested how the different estimated learning modes affect the score. For instance, the strategic learning shows out promising. A positive correlation is shown as a linear model fitted towards the upper right-hand corner.

# A scatter plot of students strategic approach on points, fitter by gender
qplot(stra, points, col = gender, data = lrn14) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Now, instead of examining a single variable’s effect and distribution, we might want to have a graphical overview of them all, with GGally

ggpairs(lrn14, lower = list(combo = wrap("facethist", bins = 20)))

iii) Fitting a linear model with function lm()

We continue building a linear model with multiple variables. This model is something that predicts the output variable (exam score) from input variables.

After examining the matrix plot above,I choose three variables: age, attitude and strategic approach) as explanatory input variables. A regression model is fitted to these variables where exam points is the dependent (output) variable.

Below is a summary of the fitted model and comment, with an interpretation of the R output.

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra + age, data = lrn14)

The parameter “points ~ attitude + stra + age” is a so-called formula. It states the dependent variable on the left-hand side of the tilde (“~”) mark. On the right-hand side are the examined coefficients are listed. In this model, the coefficients are ought to be independent of each other.

# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     0.34808    0.05622   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

The summary() function for a lm() model shows correlations and F-scores to the coefficients, which are practically the input variables for our model.

At first R outputs a summary of the weighted residuals distribution. These are the error residuals between the estimated and registered output values (points).

The first listed coefficient (intercept) is a baseline, that is a fixed the mean of the output. The following coefficients are expected to affect the output at some correlation. The coefficient correlation is the “slope” of the fitted linear estimate for the effect of the input variable and output.

For each coefficient, R prints out a t-statistic (so-called Student’s test) and a p-value, which reflects the probability of significant correlation between the coefficient and the output variable. R pretty-prints nicely significance codes: E.g. "***" on a strong significance (p<0.001), and “.” on a smaller significance (0.05 < p < 0.1) . (Usually most social studies use the 95 percent confidence rule as the baseline but we can accept this model for now.)

The F-statistic is the so-called test of equality of variances. The smaller the p-value is, the more probable is the alternative hypothesis becomes. In other words a small p-value shows that the model has a meaningful statistical relationship.

The R squared is fraction of variance explained by the model. In other words, if this value is large, the model gives a good fit and there is little or no need to search for additional explaining variables.

iv) Understanding the fitted model output

The multiple R squared of the model is 0.2182. I would have expected to have a little higher R squared. This gives an impression there would be still more descriptive variables “out there” to be added to the model. Still the R squared is positive so the model is better than fitting a straight horizontal line.

In conclusion, this model can be understood so that the attitude variable has a very significant relationship with the exam score, having a p value < 0.001. This means that this alternative hypothesis is very string.

Also the age and strategic learning mode have somewhat significant relation on the output, since the probability of null hypothesis (ie. no significance) is less than 0.1. It seems the higher age would predict slightly lower exam score, and using strategic learning would produce some extra exam points. Such effect might be worth of further study.

v) Some useful plots on the fitted model

R has this convenience function for plotting lm() models:

plot(my_model2,c(1,2,5))

These graphs help us understand and evaluate, how appropriate our statistical model for estimating the effect of student’s age, attitude towards statistics and their adaptation of strategic learning methods.

The Residuals vs fitted values graph shows a general visualization how well this linear model fits the data. If the residuals are uniformly scattered from the low to high end of fitted values, we can be pretty comfortable with these coefficients.

The Normal Q-Q plot is a quick visualization, on how well the error residuals distribution follows the Gaussian distribution. If most of the values are laid among the dotted diagonal, we could be assured there is no further systematic “explaining” variable to be examined.

The Residuals vs leverage graph is a useful tool for finding outliers in a model. An outlier is a data point that its output value is exceptionally far from the estimate that is produced from the variables. In other words, any points over the dotted red line would have high influence on the fitted model. Probably any of the points do not have specifically high influence on our model coefficients, and the residuals are quite normally distributed.


Chapter 3 - Logistic Regression

Fri 13 Nov 2020

Loading the alcohol data set

From data/alc.csv file:

alc<-read.csv("data/alc.csv")
print(dim(alc))
## [1] 382  35

A quick overview of the contents in this data: This is a student performance vs alcohol usage survey data. Data is joined from two course survey data sets, scores averaged, from 382 distinct students with both Portugese and math courses. Added average numberic column alc_use for average alcohol consumption (likert 0…5) and logical column high_use (which is TRUE if alc_use is higher than 2).

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Data contains school performance scores (Grades after periods 1-3), absences, alcohol use and some binary background data such as school name, gender, age, home location type. Numeric likert-scale estimates (from 1 to 5 values) on family relations, freetime activity, going out and health. More information at: https://archive.ics.uci.edu/ml/datasets/Student+Performance * NB! added alcohol use columns (average and logical (high consumption))

Hypothesis about high alochol consumption

Alcohol might be consumed at high rate if the initial G1 grades are low and family relations are at low level. Also stated male gender
and higher age might induce high alcohol usage.

A graphical summary:

library(GGally)
ggpairs(alc,
        columns = c('alc_use','age','sex','famrel','G1'),
        lower = list(combo = wrap("facethist", bins = 30)))

Some plots on these factors, first a bar plot of how the alc_use variable is distributed

g1 <- ggplot(alc, aes(x = high_use, y = G1, col = sex))
g1 + geom_boxplot() + ylab("age")+ggtitle("Student initial grades by high alcohol consumption and gender")

It seems like the age of high alcohol users is a bit higher, in all guardian types.

Then let’s see the age distribution within high and low users.

# initialize a plot of 'high_use'
g2 <- ggplot(alc, aes(x=(age), fill=sex))

g2 + facet_wrap("high_use")  + geom_bar() + ggtitle("age distributions by high alcohol usage and gender")

Seems like the age distribution might be relevantly different in the high alcohol consumers.

g1 <- ggplot(alc, aes(x = high_use, y = famrel, col = sex))
g1 + geom_boxplot() + ylab("famrel")+ggtitle("Family relation estimate by high alcohol consumption and gender")

These seem promising to the hypothesis model.


Logistic regression

Now I’m about to do a fitting with the binomial distribution (high alcohol consumers vs the control group)

# find the model with glm()
m <- glm(high_use ~ age + sex + G1 + famrel, data = alc, family = "binomial")

summary(m)
## 
## Call:
## glm(formula = high_use ~ age + sex + G1 + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4499  -0.8193  -0.6569   1.1384   2.1599  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.9968     1.8764  -1.064   0.2872    
## age           0.1917     0.1007   1.903   0.0571 .  
## sexM          0.9436     0.2378   3.967 7.26e-05 ***
## G1           -0.1141     0.0447  -2.553   0.0107 *  
## famrel       -0.3207     0.1262  -2.542   0.0110 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 432.29  on 377  degrees of freedom
## AIC: 442.29
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)         age        sexM          G1      famrel 
##  -1.9968299   0.1916833   0.9435755  -0.1141214  -0.3207391

All the coefficients seem to be somewhat relevant to predicting high usage. This can be more easily understood as odds ratios, which are the exponents of the coefficients.

Odds ratios and their confidence intervals:

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR       2.5 %    97.5 %
## (Intercept) 0.1357650 0.003360881 5.3701291
## age         1.2112868 0.995467029 1.4791035
## sexM        2.5691511 1.620172718 4.1227535
## G1          0.8921496 0.815955982 0.9726699
## famrel      0.7256125 0.565422154 0.9289606

Analysis of the fitted model: It seems like the good family relations reduce the high alcohol consumption. Higher age and male gender seems also to be an inducing factor to high consumption. Higher grades at first period studies also give a hint towards lesser probable high consumption.

All in all, we’ll keep these coefficients as the factors of our prediction.

Cross tabulation of a prediction model

First, I create a prediction and probability column, based on the fitting.

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

Adding the predicted probabilities to ‘alc’ and using the probabilities to make a prediction of high_use:

alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)

To understand and examine the prediction, let’s see the last ten original classes, predicted probabilities, and “high use” class predictions:

select(alc, age, sex, G1, famrel,high_use, probability, prediction) %>% tail(10)
##     age sex G1 famrel high_use probability prediction
## 373  19   M  6      4    FALSE   0.6504559       TRUE
## 374  18   M  6      5     TRUE   0.5271286       TRUE
## 375  18   F 12      5    FALSE   0.1795082      FALSE
## 376  18   F  6      4    FALSE   0.3742059      FALSE
## 377  19   F  8      5    FALSE   0.2949394      FALSE
## 378  18   F 11      4    FALSE   0.2525945      FALSE
## 379  18   F  6      2    FALSE   0.5317729       TRUE
## 380  18   F  8      1    FALSE   0.5547198       TRUE
## 381  17   M 12      2     TRUE   0.5484541       TRUE
## 382  18   M 10      4     TRUE   0.4932191      FALSE

Now let’s tabulate the target variable versus the predictions:

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   249   19
##    TRUE     93   21

We can be pretty OK with this. There is always a bit uncertainty but the model seems OK. A graphic evaluation of the probability and prediction results below:

g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point() + ggtitle("cross-tabulation of prediction and high usage, with the probability parameter")

10-fold cross-validation of the prediction

First let’s define a loss function (which is the average error in prediction).

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2931937

The computed average number of wrong predictions in the (training) data is around 0.29.

Next, we’ll do a ten-fold cross-validation with resampling

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

cv$delta[1]
## [1] 0.3010471

The average number of wrong predictions in the cross validation is not too small (about 3 wrong out of 10) but we’re quite happy with this.


Chapter 4 - Clustering and classification

Seppo Nyrkkö Fri Nov 20, 2020

i. Load the Boston data from the MASS package.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select

ii. The Boston data

is about housing values in the suburbs of Boston. It comes along with the R MASS data set collection.

( Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. – Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley. )

# structure and the dimensions of the Boston data
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

506 obs with 14 variables. All numerical.

iii. Graphical analysis

# let's take only a sample, this takes some cpu cycles and produces a grid of tiny plots
ggpairs(Boston[sample(seq(1,nrow(Boston)),size=50),])

Some correlation can be seen in the scatter plots. Few distributions are normal, most are skewed or “double-bumped”, meaning there are clusters inside the variables.

# let us try the cor() matrix and corrplot()
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

The correlation plot shows a few interesting positive correlations (radial highways and property tax, nitrogen oxides and industrial acres) and negative correlations (between industrial acres and distance to central areas; and median home values and lower status of population). Seems quite ok.

iv. Normalize the dataset

# print out summaries of the scaled data. 

boston_sc <- as.data.frame(scale(Boston))

summary(boston_sc)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

The means are all zero as expected. There is also the unit variance per each variable.

Now let’s calculate the crime quantiles (from the scaled crime rate).

bins = quantile(boston_sc$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
labels = c("low", "med_low", "med_high", "high")
crime <- cut(boston_sc$crim,
             breaks=bins,
             labels=labels,
             include.lowest=TRUE)
boston_sc <- dplyr::select(boston_sc, -crim)
boston_sc <- data.frame(boston_sc, crime)
str(boston_sc)
## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...

We have a crime factor, great. Now dividing the dataset to train and test sets: 80% goes in training

ind <- sample(seq(1,nrow(boston_sc)), size=nrow(boston_sc) * 0.8)
train <- boston_sc[ind,]
test <- boston_sc[-ind,]
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  -0.487 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  1.2307 1.015 -0.0797 1.015 1.015 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  2.73 0.253 -0.567 1.366 1.366 ...
##  $ rm     : num  -0.22 -0.637 -1.242 0.661 0.325 ...
##  $ age    : num  1.116 -0.315 -2.088 0.854 0.758 ...
##  $ dis    : num  -1.1283 -0.8536 -0.0986 -0.6988 -0.4718 ...
##  $ rad    : num  -0.522 1.66 -0.637 1.66 1.66 ...
##  $ tax    : num  -0.0311 1.5294 -0.7787 1.5294 1.5294 ...
##  $ ptratio: num  -1.7347 0.8058 0.0667 0.8058 0.8058 ...
##  $ black  : num  -2.0129 -3.6368 -0.0848 -3.9033 0.4069 ...
##  $ lstat  : num  2.121 0.425 2.366 0.67 -0.331 ...
##  $ medv   : num  -0.95 -1.341 0.127 -0.993 -0.254 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 3 4 3 4 4 3 1 2 3 1 ...
dim(train)
## [1] 404  14
dim(test)
## [1] 102  14

train and test sets seem OK.

v. LDA on the train

Using the categorical crime rate as target variable. Other variables as predictor variables.

model = lda(crime ~ ., data=train)
model
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2400990 0.2500000 0.2549505 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       1.01470844 -0.9255351 -0.08120770 -0.8741886  0.4589313 -0.8809319
## med_low  -0.05250139 -0.3033830 -0.02879709 -0.5584026 -0.1600575 -0.3174454
## med_high -0.39299539  0.1512138  0.23442641  0.3676461  0.1078278  0.4465070
## high     -0.48724019  1.0149946 -0.11943197  1.0475064 -0.4322721  0.8160792
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8284732 -0.6741266 -0.7614022 -0.45033515  0.3882932 -0.78281788
## med_low   0.3430508 -0.5343243 -0.4311818 -0.06184179  0.3462609 -0.10292896
## med_high -0.3624116 -0.4326538 -0.3372617 -0.26575060  0.0506099  0.03122454
## high     -0.8549162  1.6596029  1.5294129  0.80577843 -0.6864810  0.95680002
##                 medv
## low       0.56731761
## med_low  -0.01533676
## med_high  0.15931248
## high     -0.75770901
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.13649114  0.73254281 -0.97335467
## indus    0.01846740 -0.20847717  0.18932245
## chas    -0.10036190 -0.03263489  0.04589683
## nox      0.25420070 -0.82742185 -1.30144391
## rm      -0.10876829 -0.09078336 -0.11248719
## age      0.30109575 -0.41599055 -0.16053993
## dis     -0.09089783 -0.46500627  0.26712663
## rad      3.71217167  1.01214889 -0.38921410
## tax      0.06904740 -0.10316652  1.00807073
## ptratio  0.16164882 -0.01573812 -0.34604657
## black   -0.14328608  0.03683315  0.17387118
## lstat    0.22709150 -0.28240835  0.34950696
## medv     0.20496330 -0.45530121 -0.20603988
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9583 0.0317 0.0100

Draw the LDA (bi)plot. Use color from the crime factor index.

classes = as.numeric(train$crime)
plot(model, dimen=2, col=classes, pch=classes)

Seems like the highest crime quantile is separable from other clusters. We can proceed to the model testing part.

vi. Testing the prediction

# remove the crime factor from the test set
correct_classes = test$crime
test <- dplyr::select(test, -crime)
str(correct_classes)
##  Factor w/ 4 levels "low","med_low",..: 1 1 2 2 2 3 2 2 2 2 ...
str(test)
## 'data.frame':    102 obs. of  13 variables:
##  $ zn     : num  -0.4872 -0.4872 0.0487 0.0487 0.0487 ...
##  $ indus  : num  -0.593 -1.306 -0.476 -0.476 -0.476 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.74 -0.834 -0.265 -0.265 -0.265 ...
##  $ rm     : num  1.281 1.015 -0.16 0.131 -0.563 ...
##  $ age    : num  -0.266 -0.809 0.978 0.914 -1.051 ...
##  $ dis    : num  0.557 1.077 1.024 1.212 0.786 ...
##  $ rad    : num  -0.867 -0.752 -0.522 -0.522 -0.522 ...
##  $ tax    : num  -0.986 -1.105 -0.577 -0.577 -0.577 ...
##  $ ptratio: num  -0.303 0.113 -1.504 -1.504 -1.504 ...
##  $ black  : num  0.396 0.416 0.441 0.393 0.371 ...
##  $ lstat  : num  -1.208 -1.36 0.91 1.092 0.428 ...
##  $ medv   : num  1.3229 1.1816 0.4966 -0.819 -0.0906 ...

Test doesn’t have the correct answer - ok

Prediction of crime categories with LDA using the prepared test data:

predictions = predict(model, newdata=test)

# Cross tabulate the results with the crime categories from the test set. 
table(correct=correct_classes, predicted=predictions$class)
##           predicted
## correct    low med_low med_high high
##   low       15       9        0    0
##   med_low    4      17        8    0
##   med_high   0       5       18    2
##   high       0       0        1   23

This produces quite a fair confusion matrix with a strong diagonal. Some confusion between low and med_low is apparent, otherwise this is very OK.

vii. k-means

Let’s reload and standardize the Boston dataset

data("Boston")

# Scale the variables to mean zero and unit variance

boston_sc <- as.data.frame(scale(Boston))

# This is one of my favorite methods in R. (besides the heatmap)

dist_eu <- dist(boston_sc, method="euclidean")
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# Running k-means algorithm on the dataset. 
# What is the optimal number of clusters? Try 4 ...

km <- kmeans(boston_sc, centers=4)

# Visualize the with a few variables as pairs
pairs(boston_sc[c('tax','dis','age',
                  'nox','indus')], col=km$cluster)

Ok.

Seems like there are some clusters found, since we see some connection between the cluster color and the xy-position on the plot.

Let’s run the k-means again to find an optimal number of k clusters.

# boston_sc dataset is available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_sc, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Seems like there is a drop at 2 clusters.

Let us do the clustering again with centers=2

# k-means clustering
km <-kmeans(boston_sc, centers = 2)


# plot the Boston dataset with clusters
pairs(boston_sc, col=km$cluster)

Seems to be pretty ok. For a better picture, let’s look at the pairs data with the same variables as before:

# plot the Boston dataset with clusters
pairs(boston_sc[c('tax','dis','age',
                  'nox','indus')], col=km$cluster)

We can be pretty happy with this. There are two distinct clusters. Some bleeding can still be seen, in some pairs but the other dimensions show very distinct split between the clusters.

extra: Arrows

# Reload the original Boston dataset
data("Boston")

# Scale the variables  again to normalized

boston_sc <- as.data.frame(scale(Boston))

set.seed(123)

# 3 clusters using k-means for targets
km <- kmeans(boston_sc, centers=3)

#  LDA using the km clusters 
lda.fit <- lda(x=boston_sc, grouping=km$cluster)

# Examine the object
lda.fit
## Call:
## lda(boston_sc, grouping = km$cluster)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2806324 0.3992095 0.3201581 
## 
## Group means:
##         crim         zn        indus        chas         nox         rm
## 1  0.9693718 -0.4872402  1.074440092 -0.02279455  1.04197430 -0.4146077
## 2 -0.3549295 -0.4039269  0.009294842  0.11748284  0.01531993 -0.2547135
## 3 -0.4071299  0.9307491 -0.953383032 -0.12651054 -0.93243813  0.6810272
##          age        dis        rad        tax     ptratio      black
## 1  0.7666895 -0.8346743  1.5010821  1.4852884  0.73584205 -0.7605477
## 2  0.3096462 -0.2267757 -0.5759279 -0.4964651 -0.09219308  0.2473725
## 3 -1.0581385  1.0143978 -0.5976310 -0.6828704 -0.53004055  0.3582008
##         lstat       medv
## 1  0.85963373 -0.6874933
## 2  0.09168925 -0.1052456
## 3 -0.86783467  0.7338497
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim     0.03654114  0.20373943
## zn      -0.08346821  0.34784463
## indus   -0.32262409 -0.12105014
## chas    -0.04761479 -0.13327215
## nox     -0.13026254  0.15610984
## rm       0.13267423  0.44058946
## age     -0.11936644 -0.84880847
## dis      0.23454618  0.58819732
## rad     -1.96894437  0.57933028
## tax     -1.10861600  0.53984421
## ptratio -0.13087741 -0.02004405
## black    0.15432491 -0.06106305
## lstat   -0.14002173  0.14786473
## medv     0.02559139  0.37307811
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8999 0.1001
# now with arrows as in the datacamp task
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, 
                       color = "magenta", tex = 0.75, choices = c(1,2))
  {
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0,
         x1 = myscale * heads[,choices[1]],
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(jitter(myscale * heads[,choices]), labels = row.names(heads),
       cex = tex, col="black", pos=3)
}

# Plot with arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster, ylim=c(-7,7),xlim=c(-10,10))
lda.arrows(lda.fit, myscale = 5)

Two possible separate clusters could be explained by: i) high value homes and large lots (by surface area) and ii) the closeness to radial highways.

This clustering seems very appropriate, and we can be happy with this.


Chapter 5 - PCA and MCA

Seppo Nyrkkö Fri Nov 27, 2020

i. reading the human equality data

library(ggplot2)
library(GGally)

# read human data, row names in first column


human <- read.csv("data/human.csv",row.names=1)

# examine the structure

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ ratiofm2edu : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ ratiofmlabor: num  0.891 0.819 0.825 0.884 0.829 ...
##  $ eduexpy     : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ lifeexp     : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ gni         : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ matmort     : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adolbirth   : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parlrepr    : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8
summary(human)
##   ratiofm2edu      ratiofmlabor       eduexpy         lifeexp     
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni            matmort         adolbirth         parlrepr    
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

The data has country as row name, ratio of 2nd education M/F, ratio of labor F/M, expected years of education, exp years of life, GNI of the nation, maternal mortality, adolescence birth rate and female parliament representation ratio.

Let’s see some ggpairs:

ggpairs(human)

There might be some positive correlation between GNI and education years, and life years. Maternal mortality and adolescence birth has inverse correlation with GNI.

hist(human$gni,breaks = seq(0,1,0.2)*max(human$gni),col='blue')

The highest GNI per capita rates are only in a few countries. Bear this in mind.

ii. Some PCA

First, some PCA on the non-standardized data.

pc_human <- prcomp(human)
pc_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                        PC1           PC2           PC3           PC4
## ratiofm2edu  -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04
## ratiofmlabor  2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03
## eduexpy      -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02
## lifeexp      -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02
## gni          -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05
## matmort       5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03
## adolbirth     1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03
## parlrepr     -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01
##                        PC5           PC6           PC7           PC8
## ratiofm2edu  -0.0022935252  2.180183e-02  6.998623e-01  7.139410e-01
## ratiofmlabor  0.0022190154  3.264423e-02  7.132267e-01 -7.001533e-01
## eduexpy       0.1431180282  9.882477e-01 -3.826887e-02  7.776451e-03
## lifeexp       0.9865644425 -1.453515e-01  5.380452e-03  2.281723e-03
## gni          -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## matmort       0.0266373214  1.695203e-03  1.355518e-04  8.371934e-04
## adolbirth     0.0188618600  1.273198e-02 -8.641234e-05 -1.707885e-04
## parlrepr     -0.0716401914 -2.309896e-02 -2.642548e-03  2.680113e-03
summary(pc_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

See the large differences in deviances.

biplot(pc_human)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

This also gives a “not too clear” biplot. The data seems to be on a single axis: gni. (Might be caused from the fact that dollars and percentages are of different ranges)

iii. PCA with standardized data

We’ll continue to standardizing the variables, and do the PCA again.

stdpc_human <- prcomp(scale(human))
stdpc_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                      PC1         PC2         PC3         PC4        PC5
## ratiofm2edu  -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## ratiofmlabor  0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## eduexpy      -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## lifeexp      -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## gni          -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## matmort       0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## adolbirth     0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## parlrepr     -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                      PC6         PC7         PC8
## ratiofm2edu   0.17713316  0.05773644  0.16459453
## ratiofmlabor -0.03500707 -0.22729927 -0.07304568
## eduexpy      -0.38606919  0.77962966 -0.05415984
## lifeexp      -0.42242796 -0.43406432  0.62737008
## gni           0.11120305 -0.13711838 -0.16961173
## matmort       0.17370039  0.35380306  0.72193946
## adolbirth    -0.76056557 -0.06897064 -0.14335186
## parlrepr      0.13749772  0.00568387 -0.02306476

Exploring some dimensions:

barplot(stdpc_human$rotation[,1])
title('PC1: high maternal mortality and adolescent birth rate')

barplot(stdpc_human$rotation[,2])
title('PC2: high female labour ratio and parliament representation')

summary(stdpc_human)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000

iv. Standardized human data biplot

biplot(stdpc_human)

Now the component arrows are distinguishable. Also the biplot is clerer to read as a “map”.

The PC1 component shows the dimension where GNI is low and maternal mortality and adolescent birth rate is high. The PC2 component shows a dimension where female labor is common but also parliament representation rate is higher.

v. Tea dataset

# some libraries
library(FactoMineR)
library(dplyr)
library(tidyr)

Let’s use the tea dataset which is a survey of tea consumption. 300 individuals answered, how they drink tea (18 Q), what are their product’s perception (12 Q) and some personal details (4 Q).

data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# reduced dataset
tea_segments <- dplyr::select(tea, c("Tea", "relaxing", "effect.on.health", "sugar", "age_Q", "sex"))
str(tea_segments)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
dim(tea_segments)
## [1] 300   6
summary(tea_segments)
##         Tea             relaxing              effect.on.health      sugar    
##  black    : 74   No.relaxing:113   effect on health   : 66     No.sugar:155  
##  Earl Grey:193   relaxing   :187   No.effect on health:234     sugar   :145  
##  green    : 33                                                               
##                                                                              
##                                                                              
##    age_Q    sex    
##  15-24:92   F:178  
##  25-34:69   M:122  
##  35-44:40          
##  45-59:61          
##  +60  :38

Now let’s do Multiple Correspondence Analysis on the reduced tea data

# some visualization
gather(tea_segments) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 10))
## Warning: attributes are not identical across measure variables;
## they will be dropped

We can see our factors are populated enough in all categories, in order to make some further analysis of our reduced data set.

Now continuing to the MCA (which is kind of categorial PCA):

mca <- MCA(tea_segments, graph = FALSE)
# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_segments, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.271   0.228   0.207   0.182   0.172   0.151   0.134
## % of var.             16.232  13.667  12.445  10.919  10.329   9.038   8.038
## Cumulative % of var.  16.232  29.899  42.344  53.263  63.592  72.630  80.667
##                        Dim.8   Dim.9  Dim.10
## Variance               0.128   0.100   0.094
## % of var.              7.658   6.015   5.659
## Cumulative % of var.  88.326  94.341 100.000
## 
## Individuals (the 10 first)
##                        Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                   |  0.266  0.087  0.030 |  0.441  0.285  0.083 |  0.533
## 2                   |  1.002  1.237  0.572 | -0.333  0.163  0.063 | -0.224
## 3                   |  0.311  0.119  0.083 | -0.564  0.465  0.273 | -0.331
## 4                   | -0.817  0.822  0.643 |  0.023  0.001  0.001 |  0.122
## 5                   |  0.237  0.069  0.044 | -0.018  0.000  0.000 | -0.227
## 6                   | -0.380  0.178  0.142 | -0.187  0.051  0.034 |  0.098
## 7                   |  0.012  0.000  0.000 |  0.001  0.000  0.000 |  0.402
## 8                   |  0.494  0.300  0.121 | -0.436  0.278  0.094 |  0.830
## 9                   |  0.012  0.000  0.000 |  0.001  0.000  0.000 |  0.402
## 10                  |  0.420  0.217  0.082 |  0.110  0.018  0.006 |  0.935
##                        ctr   cos2  
## 1                    0.457  0.122 |
## 2                    0.081  0.029 |
## 3                    0.176  0.094 |
## 4                    0.024  0.014 |
## 5                    0.083  0.040 |
## 6                    0.016  0.010 |
## 7                    0.259  0.094 |
## 8                    1.107  0.343 |
## 9                    0.259  0.094 |
## 10                   1.404  0.408 |
## 
## Categories (the 10 first)
##                         Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black               |   0.828  10.412   0.224   8.190 |   0.125   0.281   0.005
## Earl Grey           |  -0.444   7.824   0.356 -10.318 |  -0.187   1.647   0.063
## green               |   0.742   3.735   0.068   4.513 |   0.814   5.336   0.082
## No.relaxing         |   0.551   7.039   0.183   7.403 |   0.217   1.294   0.028
## relaxing            |  -0.333   4.253   0.183  -7.403 |  -0.131   0.782   0.028
## effect on health    |   0.190   0.491   0.010   1.748 |   0.417   2.799   0.049
## No.effect on health |  -0.054   0.139   0.010  -1.748 |  -0.118   0.789   0.049
## No.sugar            |   0.659  13.823   0.464  11.782 |  -0.291   3.198   0.090
## sugar               |  -0.704  14.776   0.464 -11.782 |   0.311   3.419   0.090
## 15-24               |  -0.878  14.554   0.341 -10.094 |  -0.735  12.115   0.239
##                      v.test     Dim.3     ctr    cos2  v.test  
## black                 1.235 |   1.127  25.157   0.416  11.147 |
## Earl Grey            -4.344 |  -0.330   5.628   0.196  -7.663 |
## green                 4.950 |  -0.597   3.145   0.044  -3.626 |
## No.relaxing           2.913 |  -0.725  15.921   0.318  -9.749 |
## relaxing             -2.913 |   0.438   9.621   0.318   9.749 |
## effect on health      3.829 |  -0.110   0.212   0.003  -1.006 |
## No.effect on health  -3.829 |   0.031   0.060   0.003   1.006 |
## No.sugar             -5.200 |  -0.032   0.042   0.001  -0.571 |
## sugar                 5.200 |   0.034   0.045   0.001   0.571 |
## 15-24                -8.450 |  -0.008   0.002   0.000  -0.096 |
## 
## Categorical variables (eta2)
##                       Dim.1 Dim.2 Dim.3  
## Tea                 | 0.357 0.099 0.422 |
## relaxing            | 0.183 0.028 0.318 |
## effect.on.health    | 0.010 0.049 0.003 |
## sugar               | 0.464 0.090 0.001 |
## age_Q               | 0.596 0.511 0.480 |
## sex                 | 0.013 0.589 0.020 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

This gives a rather nice plot to understand the correlation between young age group (15-24) and Earl Grey tea, which is the most commonly available blend in tea.

Black tea is common choice for all age groups over 35 years.

Sugar consumption is more common in the lower age groups, as the taste buds are not yet ‘driven in’, so to say.

If designing a set of tea flavors, this graph could give a hint to create product packages and images appealing to the targeted market segments.


Chapter 6 - Longitudinal analysis

Seppo Nyrkkö Fri Dec 4, 2020

Reading the data in.

library(ggplot2)
library(dplyr)
library(tidyr)

# Rats diet weighting data in long format
longrats <- read.csv("data/ratslong.csv")

# BPRS patient treatment scores
longbprs <- read.csv("data/bprslong.csv")

i) RATS analysis

summary(longrats)
##        ID       Group         days           weight     
##  rat-1  : 11   Grp1:88   Min.   : 1.00   Min.   :225.0  
##  rat-10 : 11   Grp2:44   1st Qu.:15.00   1st Qu.:267.0  
##  rat-11 : 11   Grp3:44   Median :36.00   Median :344.5  
##  rat-12 : 11             Mean   :33.55   Mean   :384.5  
##  rat-13 : 11             3rd Qu.:50.00   3rd Qu.:511.2  
##  rat-14 : 11             Max.   :64.00   Max.   :628.0  
##  (Other):110
str(longrats)
## 'data.frame':    176 obs. of  4 variables:
##  $ ID    : Factor w/ 16 levels "rat-1","rat-10",..: 1 9 10 11 12 13 14 15 16 2 ...
##  $ Group : Factor w/ 3 levels "Grp1","Grp2",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ days  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ weight: int  240 225 245 260 255 260 275 245 410 405 ...

Quick summary of the long data (Rats):

  • Rats have unique IDs.
  • Control group category (factorized) is the diet Group
  • Time axis is numeric in days
  • Longitudinal measurements are weights
#  Some ggplots

# Display the data graphically, first by groups.

ggplot(longrats, aes(x = days, y = weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") +
  scale_y_continuous(limits = c(min(longrats$weight), max(longrats$weight)))

Seems like the grouping is done and handled correctly. Time goes in the right direction and weight grows as the days pass.

Group 1 doesn’t gain weight as much as groups 2 and 3. An outlier rat in group 2 is a bit heavier but still belongs to the correct diet group.

Since the initial weights differ so much we can understand better the effects of the diets by standardizing the data

# Standardizing = zero mean, unit = standard deviance
longrats <- longrats %>%
  group_by(days) %>%
  mutate(stdWeight = (weight - mean(weight)) / sd(weight)) %>%
  ungroup()
str(longrats)
## Classes 'tbl_df', 'tbl' and 'data.frame':    176 obs. of  5 variables:
##  $ ID       : Factor w/ 16 levels "rat-1","rat-10",..: 1 9 10 11 12 13 14 15 16 2 ...
##  $ Group    : Factor w/ 3 levels "Grp1","Grp2",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ days     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ weight   : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ stdWeight: num  -1.001 -1.12 -0.961 -0.842 -0.882 ...

This seems to adjust our weights per normal growth. Again, graphically:

ggplot(longrats, aes(x = days, y = stdWeight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "std weight")

This looks much promising. The control group (1) seems to stay at the center, and two diet groups (2 and 3) seem to affect the weight at some direction as the time passes.

Now calculate the group means and errors.

# some fancy pipelining 
ratgroups <- longrats %>%
  group_by(Group, days) %>%
  summarise(mean = mean(weight), se = sd(weight) / sqrt(length(weight))) %>%
  ungroup()

summary(ratgroups)
##   Group         days            mean             se        
##  Grp1:11   Min.   : 1.00   Min.   :250.6   Min.   : 3.604  
##  Grp2:11   1st Qu.:15.00   1st Qu.:269.5   1st Qu.: 4.809  
##  Grp3:11   Median :36.00   Median :486.5   Median :12.202  
##            Mean   :33.55   Mean   :424.7   Mean   :17.149  
##            3rd Qu.:50.00   3rd Qu.:518.2   3rd Qu.:34.903  
##            Max.   :64.00   Max.   :550.2   Max.   :37.129
dim(ratgroups)
## [1] 33  4

Now we get a graphical analysis how the groups react to the diets per days run:

ggplot(ratgroups,
  aes(x = days, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  geom_point(size=3) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="2"), width=1) +
  theme(legend.position = c(0.8,0.8,0.8)) +
  scale_y_continuous(name = "mean(weight) +/- std err")
## Warning in if (position != "none") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "manual") {: the condition has length > 1 and only
## the first element will be used
## Warning in if (position == "left") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "right") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "bottom") {: the condition has length > 1 and only
## the first element will be used
## Warning in if (position == "top") {: the condition has length > 1 and only the
## first element will be used
## Warning in if (position == "manual") {: the condition has length > 1 and only
## the first element will be used

To validate our rats, we can do the boxplot with error bars to visually determine whether our weight groups are okay or if they contain outlier rats.

ratbox <- longrats %>%
  filter(days > 1) %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(weight) ) %>%
  ungroup()
ggplot(ratbox, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=1, size=8, fill = "white", col="red") +
  scale_y_continuous(name = "mean(weight) days from 1 upwards")

summary(ratbox$Group)
## Grp1 Grp2 Grp3 
##    8    4    4

Since there are only 4 rats in groups 2 and 3, it’s really hard to say if there are any outliers. We are happy with the one suspiciously heavy rat in group 2.

Let’s continue to the hypothesis testing then. The first question is whether the mean weight of group 1 differs from others.

# Two-sample t-test
ratbox$isGrp1 <- ratbox$Group=="Grp1"
t.test(mean ~ isGrp1, data=ratbox, var.equal=T)
## 
##  Two Sample t-test
## 
## data:  mean by isGrp1
## t = 12.628, df = 14, p-value = 4.848e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  201.416 283.834
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##             507.650             265.025

This gives the 95% confidence range of group 1 as weighing from 202 to 284 grams.

Now adding a initial weight from the first day before the diets

# Since we are interested in the growth profile of these rats we take in account
# the day 1 weight

ratinit <- longrats %>%
  filter(days == 1) %>%
  group_by(Group, ID) %>%
  summarise(initialW=mean(weight) ) %>%
  ungroup()

ratbox$initialW=ratinit$initialW

fit <- lm(mean ~ initialW + isGrp1, data=ratbox)
fit
## 
## Call:
## lm(formula = mean ~ initialW + isGrp1, data = ratbox)
## 
## Coefficients:
## (Intercept)     initialW   isGrp1TRUE  
##     86.4844       0.8751     -40.7937
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## initialW   1 253625  253625 1806.5913 2.448e-15 ***
## isGrp1     1    690     690    4.9159   0.04505 *  
## Residuals 13   1825     140                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems like the group 1 has a 95% confidence to be of a different growth profile. The visual examination supports this interpretation.

ii) BPRS analysis

str(longbprs)
## 'data.frame':    360 obs. of  4 variables:
##  $ treatment: Factor w/ 2 levels "T1","T2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 40 levels "G1P1","G1P10",..: 1 12 14 15 16 17 18 19 20 2 ...
##  $ weeks    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
summary(longbprs)
##  treatment    subject        weeks        bprs      
##  T1:180    G1P1   :  9   Min.   :0   Min.   :18.00  
##  T2:180    G1P10  :  9   1st Qu.:2   1st Qu.:27.00  
##            G1P11  :  9   Median :4   Median :35.00  
##            G1P12  :  9   Mean   :4   Mean   :37.66  
##            G1P13  :  9   3rd Qu.:6   3rd Qu.:43.00  
##            G1P14  :  9   Max.   :8   Max.   :95.00  
##            (Other):306

Quick summary of the long data:

  • Subjects have unique subject-IDs (factorized).
  • Group category is the treatment id (factorized)
  • Time axis is numeric in weeks
  • Measurements are bprs score
glimpse(longbprs)
## Observations: 360
## Variables: 4
## $ treatment <fct> T1, T1, T1, T1, T1, T1, T1, T1, T1, T1, T1, T1, T1, T1, T1,…
## $ subject   <fct> G1P1, G1P2, G1P3, G1P4, G1P5, G1P6, G1P7, G1P8, G1P9, G1P10…
## $ weeks     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
summary(longbprs)
##  treatment    subject        weeks        bprs      
##  T1:180    G1P1   :  9   Min.   :0   Min.   :18.00  
##  T2:180    G1P10  :  9   1st Qu.:2   1st Qu.:27.00  
##            G1P11  :  9   Median :4   Median :35.00  
##            G1P12  :  9   Mean   :4   Mean   :37.66  
##            G1P13  :  9   3rd Qu.:6   3rd Qu.:43.00  
##            G1P14  :  9   Max.   :8   Max.   :95.00  
##            (Other):306

Let’s visually analyze the BPRS study groups:

ggplot(longbprs, aes(x=weeks, y=bprs, linetype=subject)) +
  geom_line() +
  scale_linetype_manual(values=rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller=label_both) +
  scale_y_continuous(name="BPRS", limits=c(0,100)) +
  theme(legend.position="top")

Okay this looks really good. At least the data is read in correctly and this correlates to what we read in from the original bprs survey data.

We cannot be exactly sure whether the scores go down with the treatment T1 group, but let’s continue with the analysis. The variance in T2 scores makes it difficult to visually compare the trend during the treatment weeks.

Let’s try some regression modeling

bprsreg <- lm(bprs ~ weeks + treatment, data=longbprs)
summary(bprsreg)
## 
## Call:
## lm(formula = bprs ~ weeks + treatment, data = longbprs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## weeks        -2.2704     0.2524  -8.995   <2e-16 ***
## treatmentT2   0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

This doesn’t show yet any significance of the treatment, as regards to the group means, but the weeks undergone seems to be a significant variable.

Since the observations are from repeating individuals, we would like to introduce some mixed models here. We can try to see whether we allow the unique subject id to count as a random effect:

library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Random intercept model per subject:

bprs_sub <- lmer(bprs ~ weeks + treatment + (1 | subject), data=longbprs, REML = F)
summary(bprs_sub)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ weeks + treatment + (1 | subject)
##    Data: longbprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2582.9   2602.3  -1286.5   2572.9      355 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27506 -0.59909 -0.06104  0.44226  3.15835 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 97.39    9.869   
##  Residual             54.23    7.364   
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.3521  19.750
## weeks        -2.2704     0.1503 -15.104
## treatmentT2   0.5722     3.2159   0.178
## 
## Correlation of Fixed Effects:
##             (Intr) weeks 
## weeks       -0.256       
## treatmentT2 -0.684  0.000
# Now random intercept and ALSO A random slope model

bprs_subweek <- lmer(bprs ~ weeks + treatment + (weeks | subject), data=longbprs, REML = F)
summary(bprs_subweek)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ weeks + treatment + (weeks | subject)
##    Data: longbprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.2   2550.4  -1254.6   2509.2      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4655 -0.5150 -0.0920  0.4347  3.7353 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 167.827  12.955        
##           weeks         2.331   1.527   -0.67
##  Residual              36.747   6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  45.9830     2.6470  17.372
## weeks        -2.2704     0.2713  -8.370
## treatmentT2   1.5139     3.1392   0.482
## 
## Correlation of Fixed Effects:
##             (Intr) weeks 
## weeks       -0.545       
## treatmentT2 -0.593  0.000

Notice the higher estimate of treatmentT2 – This may prove to be significant.

The week-based individual slope has quite a big effect (variance at around 2).

Now let’s run an ANOVA test against these two random effects models

anova(bprs_sub, bprs_subweek)
## Data: longbprs
## Models:
## bprs_sub: bprs ~ weeks + treatment + (1 | subject)
## bprs_subweek: bprs ~ weeks + treatment + (weeks | subject)
##              npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## bprs_sub        5 2582.9 2602.3 -1286.5   2572.9                         
## bprs_subweek    7 2523.2 2550.4 -1254.6   2509.2 63.663  2  1.499e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We would like to conclude that the model which takes the subject id and the personal slope as a random effect fits the data quite well!

At 95 percent confidence level we can consider the treatment groups different!

Let’s plot the fitted curves to examine this visually

# Compute the fitted values 
Fitted <- fitted(bprs_subweek, longbprs)

# Add a new column for the fitted data
bprslongfit <- mutate(longbprs, bprs=Fitted)

# Draw a plot of bprs (fitted and observed)
ggplot(bprslongfit, aes(x=weeks, y=bprs, group=subject)) +
  geom_line(color="blue") +
  geom_line(data=longbprs,color="red") +
  scale_x_continuous(name="weeks") +
  facet_grid(. ~ treatment, labeller=label_both) +
  scale_y_continuous(name="BPRS fitted (blue) vs observed (red)") +
  theme(legend.position="top")

This comparison helps understanding the random effects rather well. Also the fitted lines of the observations help seeing the higher BPRS level in the other treatment group, detected by the mixed effects model.